Statistical Distortion: Consequences of Data Cleaning
نویسندگان
چکیده
We introduce the notion of statistical distortion as an essential metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applicable yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improvement, statistical distortion and cost-related criteria. Existing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. We illustrate our framework on real world data, with a comprehensive suite of experiments and analyses.
منابع مشابه
Combining Quantitative and Logical Data Cleaning
Quantitative data cleaning relies on the use of statistical methods to identify and repair data quality problems while logical data cleaning tackles the same problems using various forms of logical reasoning over declarative dependencies. Each of these approaches has its strengths: the logical approach is able to capture subtle data quality problems using sophisticated dependencies, while the q...
متن کاملTunable Distortion Limits and Corpus Cleaning for SMT
We describe the Uppsala University system for WMT13, for English-to-German translation. We use the Docent decoder, a local search decoder that translates at the document level. We add tunable distortion limits, that is, soft constraints on the maximum distortion allowed, to Docent. We also investigate cleaning of the noisy Common Crawl corpus. We show that we can use alignment-based filtering f...
متن کاملQuantitative Data Cleaning for Large Databases
Data collection has become a ubiquitous function of large organizations – not only for record keeping, but to support a variety of data analysis tasks that are critical to the organizational mission. Data analysis typically drives decision-making processes and efficiency optimizations, and in an increasing number of settings is the raison d’etre of entire agencies or firms. Despite the importan...
متن کاملActiveClean: Interactive Data Cleaning For Statistical Modeling
Analysts often clean dirty data iteratively–cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while...
متن کاملAn investigation of validity and reliability of a questionnaire for use-patterns of household cleaning, personal care, and cosmetic products
Background and Objective: Regular use of household cleaning products and cosmetics can result in adverse consequences for human health. Therefore, the knowledge of consumption pattern of these products can help to evaluate the effects and finally control the consequences of their inappropriate application. As there is not an appropriate tool for evaluating public use-pattern for these products ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 5 شماره
صفحات -
تاریخ انتشار 2012